Dev/fuyuajin/maxtext backend test by amd-fuyuajin · Pull Request #557 · AMD-AGI/Primus

amd-fuyuajin · 2026-02-18T20:34:55Z

update maxtext from Aug verison to Dec version
patches align with https://github.com/ROCm/maxtext/tree/yeandy/update-patches-scaling-patch-v2-checkpoint-restore
enable cli mode for maxtext backend
segfault occurred during maxtext pretraining after installing torch(rocm7.1) required by turbo in the container. So remove the turbo-install step from the ci.yaml and disable the turbo-groupedgemm. The issue will be investaged separately.

yeandy

I'm wondering how many of these files we need?

primus/backends/maxtext/input_pipeline/_hf_data_processing.py
primus/backends/maxtext/input_pipeline/custom_packed_batch.py (I see this is deleted)
primus/backends/maxtext/layers/attention_op.py
primus/backends/maxtext/layers/attentions.py (I see this is deleted)
primus/backends/maxtext/metric_logger.py
primus/backends/maxtext/train.py
primus/backends/maxtext/train_utils.py

I think they were added in the past for the purposes of patching. @amd-fuyuajin do you know if these are getting patched into the MaxText codebase when you run the training? Even if it is, it might be the same code as what is found in rocm/jax-training:maxtext-v26.1 actually. @llying-001 might know best.

llying-001 · 2026-02-24T09:28:43Z

I'm wondering how many of these files we need?

primus/backends/maxtext/input_pipeline/_hf_data_processing.py

primus/backends/maxtext/input_pipeline/custom_packed_batch.py (I see this is deleted)

primus/backends/maxtext/layers/attention_op.py

primus/backends/maxtext/layers/attentions.py (I see this is deleted)

primus/backends/maxtext/metric_logger.py

primus/backends/maxtext/train.py

primus/backends/maxtext/train_utils.py

I think they were added in the past for the purposes of patching. @amd-fuyuajin do you know if these are getting patched into the MaxText codebase when you run the training? Even if it is, it might be the same code as what is found in rocm/jax-training:maxtext-v26.1 actually. @llying-001 might know best.

I updated these files in the Primus repo to stay aligned with the yeandy/update-patches-scaling-patch-v2-checkpoint-restore branch in ROCm/maxtext.
The third_party/maxtext in Primus comes directly from a commit of https://github.com/AI-Hypercomputer/maxtext. This commit is effectively identical to the corresponding commit in the ROCm/maxtext main branch (i.e., there are no functional code differences).
Therefore. the diff between https://github.com/ROCm/maxtext/tree/yeandy/update-patches-scaling-patch-v2-checkpoint-restore and the main branch in ROCm/maxtext serves as the reference for which patch filed need to be added on the Primus side. @amd-fuyuajin @yeandy

- Add timestamp to log filenames to prevent overwriting across runs - Move tee logging outside the inline script to capture consolidated multi-node output in a single log file - Make --nodelist conditional via NODE_LIST env variable

- set TF_CPP_MIN_LOG_LEVEL=2. Without this setting, error occurs at the end when all training steps complete. - XLA_FLAGS is case sensitive. Corrected a few values.

- detect backend framework in `primus-cli-direct.sh`. Install JAX dependencies - If using AINIC (setting USING_AINIC=1), `03_enable_ainic.sh` will run. The `LD_LIBRARY_PATH` is modified to make sure libraries are correctly loaded for JAX/MaxText. - Set XLA_PYTHON_CLIENT_MEM_FRACTION=.93 to avoid HSA_STATUS_ERROR_OUT_OF_RESOURCES error during multi-node training - Corrected some XLA_FLAGS. It is case sensitive. Values `true` and `false` do not need to be capitalized. - set TF_CPP_MIN_LOG_LEVEL=2 to suppress the error messages at the end of JAX/MaxText training Here is an example to launch JAX/MaxText traing on two nodes. `./primus-cli --config runner/maxtext-test.yaml slurm srun -N 2 -- train pretrain --config examples/maxtext/configs/MI355X/llama2_7B-pretrain.yaml`

Problem: when apt install linux-headers-"$(uname -r)", it was resolved to wrong version number on some nodes, and caused "package not found" error. Solution: remove it from the package install list. It does not affect the performance.

1. added examples for using AINIC in training 2. added more examples for running preflight 3. updated arguments format for benchmark gemm command. The script was changed, but document was not updated.

Copilot

Pull request overview

This PR adds comprehensive support for JAX/MaxText backend testing and multi-node training capabilities, including AINIC network integration, improved checkpointing, and various model architecture enhancements.

Changes:

Updated MaxText submodule to a newer commit
Added AINIC configuration support with proper environment variable setup and library path ordering
Enhanced MaxText backend with improved checkpointing, attention mechanisms, and decoder layer implementations
Refactored dependency installation to detect framework type and install appropriate requirements

Reviewed changes

Copilot reviewed 34 out of 35 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
third_party/maxtext	Updated MaxText submodule reference to newer commit
runner/use_ainic.yaml	New configuration file for AINIC network setup with container options
runner/primus-cli-direct.sh	Added framework detection logic to install correct dependencies (JAX vs PyTorch)
runner/helpers/hooks/train/pretrain/maxtext/prepare.py	Removed problematic linux-headers package, adjusted memory limits and XLA flags
runner/helpers/hooks/03_enable_ainic.sh	Fixed LD_LIBRARY_PATH ordering to append instead of prepend paths
runner/.primus.yaml	Uncommented InfiniBand device for AINIC support
requirements-jax.txt	Simplified to core dependencies only
primus/pretrain.py	Enhanced MaxText path detection to support src subdirectory
primus/modules/trainer/maxtext/pre_trainer.py	Extended patching to include initialization, checkpointing, config types, and decoder layers
primus/configs/modules/maxtext/trainer_base.yaml	Updated configuration with new parameters and removed deprecated options
primus/configs/models/maxtext/llama3.1_405B.yaml	New model configuration for Llama 3.1 405B
primus/backends/maxtext/train_utils.py	Refactored emergency checkpoint logic and updated to use max_num_checkpoints_to_keep
primus/backends/maxtext/train.py	Major refactor with barrier synchronization, improved error handling, and new training features
primus/backends/maxtext/metric_logger.py	Updated to use MetadataKey enum constants
primus/backends/maxtext/max_utils.py	Added JAX distributed initialization functions for GPU/CPU/TPU
primus/backends/maxtext/layers/moe.py	Updated MoE layer to pass bias parameters
primus/backends/maxtext/layers/mixtral.py	New Primus-specific Mixtral decoder layer implementation
primus/backends/maxtext/layers/mistral.py	New Primus-specific Mistral decoder layer implementation
primus/backends/maxtext/layers/llama2.py	New Primus-specific Llama2 decoder layer implementation
primus/backends/maxtext/layers/gemma2.py	New Primus-specific Gemma2 decoder layer implementation
primus/backends/maxtext/layers/gemma.py	New Primus-specific Gemma decoder layer implementation
primus/backends/maxtext/layers/attentions.py	Removed entire attention implementation file
primus/backends/maxtext/layers/attention_op.py	Enhanced CUDNN Flash Attention with packing and context parallelism support
primus/backends/maxtext/input_pipeline/custom_packed_batch.py	Removed custom packing implementation
primus/backends/maxtext/input_pipeline/_hf_data_processing.py	Updated to use grain's native packing and added instruction format conversion
primus/backends/maxtext/configs/types.py	New Primus-specific MaxText config with WandB and Turbo support
primus/backends/maxtext/checkpointing.py	Added comprehensive checkpoint loading logic with single replica support
examples/run_slurm_pretrain.sh	Added NODE_LIST support and timestamped log files
examples/run_pretrain.sh	Reorganized AINIC configuration and updated XLA flags
examples/run_local_pretrain.sh	Updated default Docker image to maxtext-v26.1
examples/maxtext/configs/MI355X/mixtral_8x7B-pretrain.yaml	Reduced batch size from 12 to 11
examples/maxtext/configs/MI355X/llama3.1_405B-pretrain.yaml	New training configuration for Llama 3.1 405B model
examples/maxtext/configs/MI300X/mixtral_8x7B-pretrain.yaml	Updated remat policy
docs/cli/PRIMUS-CLI-GUIDE.md	Updated documentation with AINIC configuration examples and corrected command syntax

Comments suppressed due to low confidence (4)

runner/primus-cli-direct.sh:1

Array index arithmetic should use proper bash syntax. The expression $((i+1)) correctly increments i, but when used inside array subscript it should be written as ${args[i+1]} without the extra parentheses, or the current form needs validation that i+1 is within array bounds before access.
runner/primus-cli-direct.sh:1
Python code embedded in bash script should properly close file handles. The open('$cfg_path') should be wrapped in a context manager using with open('$cfg_path') as f: cfg = yaml.safe_load(f) to ensure the file is properly closed even if an exception occurs.
primus/backends/maxtext/max_utils.py:1
Operator precedence issue: the condition mixes or and and without parentheses. Due to operator precedence, this evaluates as (self.wandb_save_dir is None) or (self.wandb_save_dir == '' and self.base_output_directory), which may not be the intended logic. Add explicit parentheses: if (self.wandb_save_dir is None or self.wandb_save_dir == '') and self.base_output_directory:

###############################################################################

primus/backends/maxtext/max_utils.py:1

Same operator precedence issue as above. Should be: if (self.wandb_exp_name is None or self.wandb_exp_name == '') and self.run_name:

###############################################################################

primus/pretrain.py

primus/modules/trainer/maxtext/pre_trainer.py

primus/backends/maxtext/train.py

primus/backends/maxtext/layers/attention_op.py

examples/run_pretrain.sh

docs/cli/PRIMUS-CLI-GUIDE.md

accept copilot commit suggestion Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

examples/run_pretrain.sh

…le model override args set

…backend-test

…etect_framework in step 5 for cli

…backend-test

yeandy requested review from alfuyao1986 and yeandy February 18, 2026 20:35

amd-fuyuajin removed the request for review from yeandy February 18, 2026 20:35

yeandy self-requested a review February 18, 2026 20:35

yeandy reviewed Feb 19, 2026

View reviewed changes

llying-001 and others added 16 commits February 25, 2026 03:09

update maxtext to version 022dc02

0511e5d

update maxtext part 1

2e58d67

update maxtext: part II

a506657

update maxtext III

a6f37d8

update jax docker to 26.1

b96d0b3

fix ib deps for jax

3767dbd

fix mixtral config error

cbda947

add 405b config file

e0794c9

update multi-node shell for jax

a704cee

corrected XLA_FLAGS and added env var to suppress errors

f09fc6f

- set TF_CPP_MIN_LOG_LEVEL=2. Without this setting, error occurs at the end when all training steps complete. - XLA_FLAGS is case sensitive. Corrected a few values.

add /dev/infiniband as default in primus-cli global config file

eac1364

add primus-cli global config yaml file for AINIC usage

08967fd

Updated Primus-cli user guide

095b267

1. added examples for using AINIC in training 2. added more examples for running preflight 3. updated arguments format for benchmark gemm command. The script was changed, but document was not updated.

zhaoh27 force-pushed the dev/fuyuajin/maxtext-backend-test branch from 2e31891 to 095b267 Compare February 25, 2026 03:15

llying-001 marked this pull request as ready for review February 25, 2026 03:22

llying-001 requested a review from Xiaoming-AMD as a code owner February 25, 2026 03:22

Copilot AI review requested due to automatic review settings February 25, 2026 03:22

llying-001 requested review from limou102 and wenxie-amd as code owners February 25, 2026 03:22

Copilot AI reviewed Feb 25, 2026

View reviewed changes

Update docs/cli/PRIMUS-CLI-GUIDE.md

87bc2e7

accept copilot commit suggestion Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

llying-001 added 2 commits February 25, 2026 07:38

fix up by review

fdb9c48

update cicd for maxtext

36a162d

yeandy reviewed Feb 25, 2026

View reviewed changes

examples/run_pretrain.sh Show resolved Hide resolved

examples/run_pretrain.sh Show resolved Hide resolved

llying-001 added 4 commits February 26, 2026 14:40

disable turbo install to avoid segfault, update cicd for jax and enab…

ba5c95c

…le model override args set

Merge branch 'main' into dev/fuyuajin/maxtext-backend-test

faec60e

unify NCCL_IB_TC and NCCL_IB_FIFO_TC for maxtext and torch

8357b2f

Merge remote-tracking branch 'origin/main' into dev/fuyuajin/maxtext-…

da2b3bf

…backend-test

llying-001 requested a review from qianghan-amd February 28, 2026 05:27

llying-001 added 4 commits March 3, 2026 09:28

update jax ainic image and add detect_nccl_ib_tc for cli and remove d…

694420e

…etect_framework in step 5 for cli

Merge remote-tracking branch 'origin/main' into dev/fuyuajin/maxtext-…

7941fb7

…backend-test

update cicd runner for jax

846b4b3

update docker file ainic for jax

24b2931

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev/fuyuajin/maxtext backend test#557

Dev/fuyuajin/maxtext backend test#557
amd-fuyuajin wants to merge 27 commits intomainfrom
dev/fuyuajin/maxtext-backend-test

amd-fuyuajin commented Feb 18, 2026 •

edited by llying-001

Loading

Uh oh!

yeandy left a comment

Uh oh!

llying-001 commented Feb 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

amd-fuyuajin commented Feb 18, 2026 • edited by llying-001 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yeandy left a comment

Choose a reason for hiding this comment

Uh oh!

llying-001 commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

amd-fuyuajin commented Feb 18, 2026 •

edited by llying-001

Loading

llying-001 commented Feb 24, 2026 •

edited

Loading